mean teacher
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks. It maintains an exponential moving average of label predictions on each training example, and penalizes predictions that are inconsistent with this target. However, because the targets change only once per epoch, Temporal Ensembling becomes unwieldy when learning large datasets. To overcome this problem, we propose Mean Teacher, a method that averages model weights instead of label predictions. As an additional benefit, Mean Teacher improves test accuracy and enables training with fewer labels than Temporal Ensembling. Without changing the network architecture, Mean Teacher achieves an error rate of 4.35% on SVHN with 250 labels, outperforming Temporal Ensembling trained with 1000 labels. We also show that a good network architecture is crucial to performance. Combining Mean Teacher and Residual Networks, we improve the state of the art on CIFAR-10 with 4000 labels from 10.55% to 6.28%, and on ImageNet 2012 with 10% of the labels from 35.24% to 9.11%.
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks. It maintains an exponential moving average of label predictions on each training example, and penalizes predictions that are inconsistent with this target. However, because the targets change only once per epoch, Temporal Ensembling becomes unwieldy when learning large datasets. To overcome this problem, we propose Mean Teacher, a method that averages model weights instead of label predictions. As an additional benefit, Mean Teacher improves test accuracy and enables training with fewer labels than Temporal Ensembling. Without changing the network architecture, Mean Teacher achieves an error rate of 4.35% on SVHN with 250 labels, outperforming Temporal Ensembling trained with 1000 labels. We also show that a good network architecture is crucial to performance. Combining Mean Teacher and Residual Networks, we improve the state of the art on CIFAR-10 with 4000 labels from 10.55% to 6.28%, and on ImageNet 2012 with 10% of the labels from 35.24% to 9.11%.
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
- Information Technology > Sensing and Signal Processing > Image Processing (0.93)
- Information Technology > Artificial Intelligence > Vision > Image Understanding (0.46)
Improving Respiratory Sound Classification with Architecture-Agnostic Knowledge Distillation from Ensembles
Toikkanen, Miika, Kim, June-Woo
Respiratory sound datasets are limited in size and quality, making high performance difficult to achieve. Ensemble models help but inevitably increase compute cost at inference time. Soft label training distills knowledge efficiently with extra cost only at training. In this study, we explore soft labels for respiratory sound classification as an architecture-agnostic approach to distill an ensemble of teacher models into a student model. We examine different variations of our approach and find that even a single teacher, identical to the student, considerably improves performance beyond its own capability, with optimal gains achieved using only a few teachers. We achieve the new state-of-the-art Score of 64.39 on ICHBI, surpassing the previous best by 0.85 and improving average Scores across architectures by more than 1.16. Our results highlight the effectiveness of knowledge distillation with soft labels for respiratory sound classification, regardless of size or architecture.
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Greece > Central Macedonia > Thessaloniki (0.04)
A mean teacher algorithm for unlearning of language models
One of the goals of language model unlearning is to reduce memorization of selected text instances while retaining the model's general abilities. Despite various proposed methods, reducing memorization of large datasets without noticeable degradation in model utility remains challenging. In this paper, we investigate the mean teacher algorithm (Tarvainen & Valpola, 2017), a simple proximal optimization method from continual learning literature that gradually modifies the teacher model. We show that the mean teacher can approximate a trajectory of a slow natural gradient descent (NGD), which inherently seeks low-curvature updates that are less likely to degrade the model utility. While slow NGD can suffer from vanishing gradients, we introduce a new unlearning loss called "negative log-unlikelihood" (NLUL) that avoids this problem. We show that the combination of mean teacher and NLUL improves some metrics on the MUSE benchmarks (Shi et al., 2024).
- Information Technology > Security & Privacy (1.00)
- Law (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)
Reviews: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
The paper proposes a new method for using unlabeled data in semi-supervised learning. The idea is to construct a teacher network from student network during training by using an exponentially decaying moving average of the weights of the student network, updating after each batch. This is inspired by previous work that uses a temporal ensemble of the softmax outputs, and aims to reduce the variance of the targets during training. Noise of various forms is added to both labelled and unlabeled examples, and a L2 penalty is added to encourage the student outputs to be consistent with the teachers. As the authors mention, this acts as a kind of soft adaptive label propagation mechanism. The advantage of their approach over temporal ensembling is that it can be used in the online setting.
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
Antti Tarvainen, Harri Valpola
The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks. It maintains an exponential moving average of label predictions on each training example, and penalizes predictions that are inconsistent with this target. However, because the targets change only once per epoch, Temporal Ensembling becomes unwieldy when learning large datasets. To overcome this problem, we propose Mean Teacher, a method that averages model weights instead of label predictions. As an additional benefit, Mean Teacher improves test accuracy and enables training with fewer labels than Temporal Ensembling. Without changing the network architecture, Mean Teacher achieves an error rate of 4.35% on SVHN with 250 labels, outperforming Temporal Ensembling trained with 1000 labels. We also show that a good network architecture is crucial to performance. Combining Mean Teacher and Residual Networks, we improve the state of the art on CIFAR-10 with 4000 labels from 10.55% to 6.28%, and on ImageNet 2012 with 10% of the labels from 35.24% to 9.11%.
Table-Filling via Mean Teacher for Cross-domain Aspect Sentiment Triplet Extraction
Peng, Kun, Jiang, Lei, Li, Qian, Li, Haoran, Yu, Xiaoyan, Sun, Li, Sun, Shuo, Bi, Yanxian, Peng, Hao
Cross-domain Aspect Sentiment Triplet Extraction (ASTE) aims to extract fine-grained sentiment elements from target domain sentences by leveraging the knowledge acquired from the source domain. Due to the absence of labeled data in the target domain, recent studies tend to rely on pre-trained language models to generate large amounts of synthetic data for training purposes. However, these approaches entail additional computational costs associated with the generation process. Different from them, we discover a striking resemblance between table-filling methods in ASTE and two-stage Object Detection (OD) in computer vision, which inspires us to revisit the cross-domain ASTE task and approach it from an OD standpoint. This allows the model to benefit from the OD extraction paradigm and region-level alignment. Building upon this premise, we propose a novel method named \textbf{T}able-\textbf{F}illing via \textbf{M}ean \textbf{T}eacher (TFMT). Specifically, the table-filling methods encode the sentence into a 2D table to detect word relations, while TFMT treats the table as a feature map and utilizes a region consistency to enhance the quality of those generated pseudo labels. Additionally, considering the existence of the domain gap, a cross-domain consistency based on Maximum Mean Discrepancy is designed to alleviate domain shift problems. Our method achieves state-of-the-art performance with minimal parameters and computational costs, making it a strong baseline for cross-domain ASTE.
- North America > United States > Idaho > Ada County > Boise (0.05)
- Asia > China > Beijing > Beijing (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (5 more...)
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)